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Abstract 

Understanding the dynamics of neural networks is a major chal- 
lenge in experimental neuroscience. For that purpose, a modelling of 
the recorded activity that reproduces the main statistics of the data 
is required. In a first part, we present a review on recent results deal- 
ing with spike train statistics analysis using maximum entropy models 
(MaxEnt). Most of these studies have been focusing on modelling 
synchronous spike patterns, leaving aside the temporal dynamics of 
the neural activity. However, the maximum entropy principle can be 
generalized to the temporal case, leading to Markovian models where 
memory effects and time correlations in the dynamics are properly 
taken into account. In a second part, we present a new method based 
on Monte-Carlo sampling which is suited for the fitting of large-scale 
spatio-temporal MaxEnt models. The formalism and the tools pre- 
sented here will be essential to fit MaxEnt spatio-temporal models to 
large neural ensembles. 

1 Introduction 

The structure of the cortical activity, and its relevance to sensory stimuli 
or motor planning, have been subject to long standing debate. While some 
studies tend to demonstrate that the majority of the information conveyed 
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by neurons is contained in the mean firing rate j59], other works have shown 
evidence for a role of the higher order neural assemblies in neural coding 

ra, m m, m and 123). 

Many single cell studies have reported an irregular spiking activity which 
seems to be very close to a Poisson process; concluding that the activity spans 
a very large state space. Several studies claim that some specific patterns, 
called "cortical songs", appear in a recurrent fashion (|27j). but their existence 
is controversial ( |4Uj and |34J), suggesting that the size of the state space 
explored by the activity could be smaller than expected. This point requires 
an accurate description of the neural activity of populations of neurons Q66J, 
03], (32], [SO and [2]). 

These controversies partially originate from the fact that characterizing 
the statistics of the neural activity observed during the simultaneous record- 
ing of several neurons is challenging, since the number of possible patterns 
grows exponentially with the number of neurons. As a consequence, the prob- 
ability of each pattern cannot be reliably measured by empirical averaging, 
and an underlying model is necessary to reduce the number of variable to 
be estimated. To infer the whole state of the neural network, some attempts 
have been done to build a hidden dynamical model which would underlie 
the cortical responses of several recorded neurons. Most of the time, this 
approach has been used to characterize the activity of neurons during dif- 
ferent types of behaviour. Among others, Shenoy and colleagues |60j used 
a dynamical system to model the activities of multiple neurons recorded in 
the motor areas. Most of the time, in this approach, the number of neurons 
largely exceeds the number of parameters. The assumed low dimension of 
the underlying dynamical system is often due to the low dimension of the 
behavioural context itself. For example, in a task where a monkey is asked 
to make a choice between a small number of options (e.g. moving toward 
one target amongst several), one can expect that the features of the neural 
activity which are relevant to this task can be described with a number of 
parameters which is comparable to the number of possible actions. 

For more complex tasks or stimuli, the dimension of these models may 
have to be increased. This would be especially critical in the case of sensory 
networks stimulated with natural or complex stimuli. For this latter issue, a 
different strategy has been proposed by Schneidman et al \5U\ and Shlens et 
al [621 163] . Their purpose was to describe the statistics of the retinal activity 
in response to natural stimuli. They defined a set of values (mean firing rates, 
correlations...) that must be fitted, and then picked the least structured of 
the models that would satisfy these constraints. This approach, which will 
be described below, is based on maximum entropy models of the activity. It 
is interesting to point out that, while the previous approach aims at finding 
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a useful representation of the activity with the lowest dimension, the maxi- 
mum entropy approach picks the model with the highest dimension. 



In this paper, we first describe the challenge of modelling the statistics 
of the neural activity, and review the results that were obtained using max- 
imum entropy models. Many studies focused on modelling the synchronous 
patterns, putting aside the issue of modelling the temporal dynamics of the 
neural activity. We show why the extent of maximum entropy models to the 
temporal case raises specific issues, such as treating correctly memory and 
time correlations, and how they can be solved. The corresponding section 
(section [2j reviews the maximum entropy approach, and focuses on applying 
it to general spatio-temporal constraints. We also include a short discussion 
on other spatio-temporal approaches to spike train statistics such as the Gen- 
eralized Linear Model (HI ESI SSI LTS1 SHI SZl SI SB] - In a second part (section 
[3]) we present a new method, based on Monte-Carlo sampling, which is suited 
for the fitting of large scale spatio-temporal models. The section [4] provides 
examples and numerical tests in order to show how far we can go with the 
Monte-Carlo method and its performance. 

2 The maximum entropy principle 

In this section, we present the maximum entropy principle in a general set- 
ting. We first give a set of notations and definitions, then present a brief 
history of this principle in spike train analysis. Finally, we introduce a frame- 
work which allows the handling of general spatio-temporal constraints. 

2.1 Notations and definitions 
2.1.1 Spike trains 

We consider a network of N neurons. We assume there is a minimal time scale 
5, such that a neuron can fire at most one spike within a time window of size S. 
To each neuron k and discrete time n, we associate a spike variable: Wfc(n) = 1 
if neuron k fires at time n, and u>k{n) = otherwise. The state of the entire 

network in time bin n is thus described by a vector u(n) = f [ ujk{n) ]^ =1 , called 
a spiking pattern. 

A spike block, which describes the activity of the whole network between 
to moment of time ri\ and n 2 , is a finite ordered list of such vectors, written: 
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The range of a block is n 2 — ri\ + 1, the number of time steps from rt\ to n 2 . 
Here is an example of a spike block of range 5 with N = A neurons. 

(1 1 1 \ 
1 1 1 
1 1 
10 11/ 

A spike train or raster is a spike block cUq^ 1 from some initial time to 
some final time T — 1. To simplify notation we simply write u for a spike 
train. We note Q = { 0, 1 } NT the set of all possible spike trains. 

2.1.2 Observables 

We call observable a function: 

r 

0(u) = l[u ku (n u ), (1) 

u=l 

i.e. a product of binary spike events where k u is a neuron index and n u a 
time index, with u = l,...,r, for some integer r > 0. Typical choices of 
observables are oj^ijii) which is 1 if neuron k\ fires at time n\ and which 
is otherwise; io^{n\) cjfc 2 (n 2 ) which is 1 if neuron k\ fires at time n\ and 
neuron fc 2 fires at time n 2 and which is otherwise. Another example is 
^ki( n i) (1 ~~ UJ k 2 { n 2)) which is 1 is neuron k\ fires at time n\ and neuron k 2 
is silent at time n 2 . This example emphasizes that observables are able to 
consider events where some neurons are silent. 

We say that an observable O has range R if it depends on R consecutive 
spike patterns, e.g. 0(u) = O^q^ 1 ). We consider here that observables do 
not depend explicitly on time (time-translation invariance of observables). 
As a consequence, for any time n, O^^ 1 ) = O(oo™ +R ~ l ) whenever Uq' 1 = 

, .n+R-l 

2.1.3 Spike train statistics 

It is common in the study of spike trains to attempt to detect some statistical 
regularity. Spike trains statistics is assumed to be summarized by a hidden 
probability \i characterizing the probability of spatio-temporal spike patterns: 
fj, is defined as soon as the probability /x [co™* ] of any block w™^ is known. 
We assume that /x is time-translation invariant: for any time n, /x [cOq^ 1 ] = 
/x [a;™^ -1 ] , whenever uj^~ 1 = 

Equivalently, /x allows the computation of the average of the observables. 
We note fx[0] the average of the observable O under /x. If 0(u) = ^(ni) 
then /x [ O ] is the firing rate of neuron k\ (it does not depend on rt\ from the 
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time-translation invariance hypothesis); if O = (^(ni) Uk 2 {ri2), then /i [O] is 
the probability that neurons k\ and k 2 fire over the time span n% — n\. Addi- 
tionally, n [ 0;^! (0) Uk 2 (0) ] — fi [ LOfa (0) ] /i [ Uk 2 (0) ] represents the instantaneous 
pairwise correlation between the neurons k\ and k 2 - 

There are several methods which allow the computation or estimation 
of \i. In the following we shall assume that neural activity is described by 
a Markov process with memory depth D and positive time-translation in- 
variant transition probabilities P [tu(D) | ] > 0. From the assumption 
P [cu(D) | cOq^ 1 ] > 0, this chain has a unique invariant probability ft such 
that, for any n > D, and any block cUq : 

n-D 

n p [^ +/ ) k+^m^ 1 ]- (2) 

1=0 

Therefore, knowing the transition probabilities (corresponding to blocks cJq 
of range D + 1) and fi (which can be determined as well from the transition 



probabilities as exposed in section 2.3.1 ), the probability of larger blocks can 



be computed. Equation Q makes explicit the role of memory in statistics of 
spike blocks, via the product of transition probabilities and the probability 
of the initial block [i [cu® _1 ] . 

On the opposite, if D — 0, the probability to have the spike pattern oo(D) 
does not depend on the past activity of the network (memory-less case). In 
this case P [u)(D + /) | uof +l ~ l ] becomes /i [co(l) } and the probability (p| of 
a block becomes: 

n 

A*K]=nA*[w(01- (3) 

1=0 

Therefore, in the memory-less case, spikes occurring at different times are 
independent. 

This emphasizes the deep difference between the case D = and the case 
D > 0. 



2.1.4 Empirical average 

Let us assume that we are given an experimental raster of length T, such that 
uJq~ x . The estimation of spikes statistics has to be done on this sample. In 
the context of the maximum entropy principle, where statistics is assumed to 
be time translation invariant, statistics of events is obtained via time-average. 
The time-average or empirical average of an observable O in a raster u of 
length T is denoted by ir^ [O]. For example, if O = ^(0) the time-average 
7Tw [O] = ^-j- J2 n =o u k( n ) is the firing rate of neuron k, estimated on the 
experimental raster uj. 
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The empirical average is a random variable, depending on the raster 00, 
as well as on the time length of the sample and it has Gaussian fluctuations 
whose amplitude tends to as T — > +00 like -4=. This is the case, e.g. for 
the empirical averages obtained from several spike train acquired with several 
repetition. 



2.1.5 Complexity of the set of spike blocks. 

If one has N neurons and wants to consider spike block events within R 
time steps, one has 2^ possible states. For a reasonable Multi Electrodes 
Array (MEA) sample, N = 100, R = 3 (for a time lag of 30ms with a 10ms 
binning), this gives 2 300 ~ 4 x 10 180 , which is quite a bit more than the 
expected number of particles in the (visible) universe. Taking into account 
the huge number of states in the set of blocks, it is clear that any method 
requiring the extensive description of the state space will fail as NR grows. 
Additionally, while the accessible state space is huge, the observed state space 
(e.g. in an experimental raster) is rather small. For example, in a MEA 
raster for a retina experiment, the sample size T is about 10 6 — 10 7 , which 
is quite a bit less than 2^. As a matter of fact, any reasonable estimation 
method must take this small-sample constraint into account. As we show in 
the next section, the maximum entropy principle and the related notion of 
Gibbs distributions allows us to take these aspects into account. 



2.2 The maximum entropy principle 
2.2.1 Motivations 



Following |2.1| the goal is to find a probability distribution \i such that: 

• fi is inferred from an empirical raster u, by computing the empirical 
average of a set of ad hoc observables Ok, k — 1, . . . , K. One asks that 
the average of Ok with respect to fi satisfies: 

fi[O k ]=nP[O k ], k = l,...,K. (4) 

The mean of Ok, predicted by /i is equal to the mean computed on the 
experimental raster, /i is called a "model" in the sequel. The set of 
observables Ok defines the model. 

• /i has to be "as simple as possible", with the least structure and a 
minimum number of tunable parameters. In the maximum entropy 
paradigm |28j these issue are (partly) solved by seeking a probability 
distribution /i which maximizes the entropy under the constraints 
The entropy is defined explicitly below (see eq. ([5]), (21)). 
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• From the knowledge of /x one can compute the probability of arbitrary 
blocks (e.g. via eq. ([2])) and the average of other observables than the 
O k s. 

Remark. 

Assume that we want to select observables Ok in the set of all possible 
observables with range R. For N neurons there are 2^ possible choices. 
When NR increases, the number of possible observables will quickly exceed 
the number of samples available in the recording. Including all of them in 
the model would overfit the data. Therefore, one has to guide the choice of 
observables by additional criteria. We now review some of the criteria, which 
have been used by other authors. 



2.2.2 Spatial models 

In a seminal paper, Schneidman et al [56] aimed at unravelling the role of 
instantaneous pairwise correlations in retina spike trains. Although these 
correlations are weak, researchers investigated whether they play a more 
significant role in spike train statistics than firing rates. 

Firing rates correspond to the average of observables of the form Ui(0),i = 
1 , . . . , N (the time index comes from the assumed time-translation invari- 
ance) while instantaneous pairwise correlations correspond to averages of 
observables of the form Ui(0)ujj(0), 1 < i < j < N. Analysing the role of pair- 
wise correlations in spike train statistics, compared to firing rate, amounts 
therefore to comparing two models, defined by two different types of observ- 
ables. 

Note that all of these observables correspond to spatial events occurring 
at the same time. They give no information on how the spike patterns at 
a given time depend on the past activity. This situation corresponds to a 



memory-less model (D = in section 2.1.3), where transition probabilities do 



not depend on the past. As a consequence the sought probability /x weights 
blocks of range 1, and the probability of blocks with larger range is given 
by (3): spike patterns at successive time steps are independent in spatial 
models. 

In this case, the entropy of /x is given by: 

5(Ai) = -J3A*[w(0)]tog/i[w(0)]. (5) 

w(0) 

The natural log could be replaced by the logarithm in base 2. 
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Now, the maximum entropy principle of Jaynes |28] corresponds to seek- 
ing a probability [i which maximizes S{p) under the constraints (|4]). It can 
be shown (see section 2.3 for the general statement) that this maximization 



problem is equivalent to the following Lagrange problem: maximising the 
quantity S(fi) + /i [Up] where Up, called a potential is given by: 

K 

Up = J2^0 k . (6) 

k=l 

The /3fcS are real numbers and free parameters. \i [Up] is the average of 
Up with respect to //. Since Up is a linear combination of observables we 
have [j, [Up] = J2k=i Pk^ [ Ok ]■ If the O^s have finite range and are > — oo 



and if the (3kS are finite then it can be shown (see section 2.3) that there is 
only one probability //, depending on the /3fcS, which solves the maximization 
problem. It is called a Gibbs distribution. 
In this context (D = 0) it reads: 

A*["(0)] = —=—, (7) 

Z /3 



where the normalization factor 



e 



W/s(w(0)) 



w(0) 

is the so-called partition junction. 

To match Q the parameters have to be tuned. This can be done 
thanks to the following property of Zp\ 

ri^.^. (9, 

Thus the [3kS have to be tuned so that \i matches Q as well as ([9]). It turns 
out that log Zp is convex with respect to /3&S, so the problem has a unique 
solution. 

Note that log Zp does not only allow us to obtain the averages of the ob- 
servables, it also allows to characterize fluctuations. If a raster is distributed 
according to the Gibbs distribution ([7]), then, as pointed out in section 2.1.4 



the empirical average of an observable has fluctuations. One can show that 
these fluctuations are Gaussian (Central limit theorem). The joint probabil- 
ity of tt^ [Ok], k = 1, . . . , K is Gaussian, with mean // [ O k ] given by (9 ) 
and covariance matrix p where the matrix S has entries: 

_ <9 2 log^ 

Ekl ~ -mm- (10) 
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Let us now discuss what this principle gives in the two cases considered 
by Schneidman et al. 



(i) Only firing rates are constrained. Then: 

TV 
k=l 

It can be shown that the corresponding probability fi is: 



c)]=nt 



Thus, the corresponding statistical model is such that spikes emitted by 
distinct neurons at the same time are independent. The parameter j3k 
is directly related to the firing rate since = fi [wjfe(O) = 1 ] = ^_l$ k , 
so that we have: 



n N 
1=0 k=l 

the classical probability of coin tossing with independent probabilities 
(Bernoulli model). Thus, fixing only the rates as constraints, the max- 
imum entropy principle leads to analyze spike statistics as if each spike 
were thrown randomly and independently, as with coin tossing. This is 
the "most random model", which has the advantage of making as few 
hypothesis as possible. However, when only constrained with mean fir- 
ing rates, the prediction of even small spike blocks in the retina was not 
successful. This was expected since this model assumes independence 
between neurons, an assumption that has been proven wrong in earlier 
studies (e.g. [51J). 

(ii) Firing rates and pairwise correlations are constrained. In the 

second model, Schneidman et al constrained the maximum entropy 
model with both mean firing rates and instantaneous pairwise corre- 
lations between neurons. In this case, 

N N 

H p (u(0)) = ^/5 fc u; fc (0) + A«w*(0)wi(0). 

fc=l k,l=l 

Here the potential can be identified with the Hamiltonian of a mag- 
netic system with binary spins. It is thus often called "Ising model" 
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in the spike train analysis literature, although the original Ising model 
has constant and positive couplings |22j . The corresponding statistical 
model is the least structured model respecting these first and second 
order pairwise instantaneous constraints. The number of parameters is 
of the ordei^of iV 2 , to be compared with the 2 N possible patterns. 

Schneidman et al showed that the Ising model model successfully predicts 
spatial patterns, a result which was confirmed by [62] (see |JT] for a review). 
Other works have used the same method and found also a good prediction 
in cortical structure in vitro [70], in the visual cortex in vivo [80] . Later on, 
several authors considered higher order terms still corresponding to D = 
([42, 56, EU dH]). Note that these results have been obtained on relatively 
small subsets of neurons (usually groups of 10). An interesting challenge 
is to test how these results hold for larger subsets of neurons, and if other 
constraints have to be added [20j(Tkacik et al, in preparation). 

2.2.3 One time step spatio-temporal models and detailed balance 

These models are only designed to predict the occurrence of "spatial" pat- 
terns, lying within one time bin. The use of spatial observables naturally 
leads to a time independence assumption where the probability of occur- 
rence of a spatio-temporal pattern is given by the product ([3]). Tang et al. 
|70| tried to predict the temporal statistics of the neural activity with such 
a model and showed that it does not give a faithful description of temporal 
statistics. The idea to consider spatio-temporal observables then naturally 
emerges with the problem of generalising the probability eq. ([7]) to that case. 

From the statistical mechanics point of view, a natural extension consists 
of considering the space of rasters O as a lattice where one dimension is 
"space" (neurons index) and the other is time. The idea is then to consider a 
potential still of the form ^ but where the observables correspond to spatio- 
temporal events. We assume that Hp has range R = D + 1, < D < +oo. 
The potential of a spike block Uq , n > D is: 

n-D 

^K") = E^K +Z ) ( n ) 

1=0 

On this basis, restricting to the case where D = 1 (one time step memory 
depth) Marre et al have proposed in [37] to construct a Markov chain, where 
transition probabilities P[u(l + 1) \u)(l) } are proportional to e W/3 ^ + \ If 

Most approaches assumes moreover that the pairwise coefficients are symmetric /3ki = 
ftik which divides the number of parameters by 2. 
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/x is the invariant probability of that chain, the application of ([2]) leads to 

probability of blocks /j [ojq ], proportional to e fi < 3 ( w o )• the probability of a 
block is proportional to the exponential of its potential ("energy"). This 
approach is therefore quite natural from the statistical mechanics point of 
view. 

The main problem, however, is "what is the proportionality coefficient 
?" As shown in |37j . the normalization of conditional probabilities does not 
reduce to the mere division by a constant partition function. This normal- 
ization factor is itself dependent on the past activity. 

To overcome this dependency, Marre et al assumed that the activity re- 
spected a detailed balance. In this particular case, it can be shown that the 
normalization factor becomes, again, a constant. But this is an important 
reduction that could have implications for the interpretation of the data: for 
example, with this simplification, it is not possible to give an account of 
assymetric cross-correlograms. 

2.3 General spatio-temporal models 

We now present the general formalism which allows to solve the variational 
problem "maximising entropy under spatio-temporal constraints". This ap- 
proach is rigorous and the normalization problem is resolved without requir- 
ing additional assumptions such as detailed balance. At the end of this sec- 
tion, we briefly discuss other approaches considering spatio-temporal statis- 
tics and their relations to potentials of the form ([6]). 

2.3.1 Constructing the Markov Chain 

In this section we show how one can generate a Markov chain where transition 
probabilities are proportional to e^P&i \ for a potential Hp corresponding 
to spatio-temporal events. We also solve the normalization problem. This 
construction is well known and is based on the so-called transfer matrix (see 
e.g. |22j for a presentation in the context of statistical physics; [JB] for a 
presentation in the context of ergodic theory and [75] for a presentation in 
the context of spike trains analysis). 

This matrix is constructed as follows. Consider two spike blocks W\,W2 of 
range D > 1. The transition w\ — > W2 is legal if w\ has the form ^(O)^^ 1 
and W2 has the form u®~ l u(D). The vectors u(0),u(D) are arbitrary but 
the block cuf" 1 is common. Here is an example of a legal transition : 

wi= [I ? \ };w 2 = [I I I]. 
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Here is an example of a forbidden transition 

u* = [s ? i];«*= [S i o]- 

Any block cu^ of range i? = £) + 1 can be viewed as a legal transition 
from the block Wi = oo^ 1 to the block W2 = and in this case we write 

UJq ~ WiW 2 - 



The transfer matrix C is defined as: 

J e^P&a) if W\,W2 is legal with w^ 1 ~ Wiw 2 / 19 \ 
™ ~ \ 0, otherwise. ' 1 j 

From the matrix C the transition matrix of a Markov chain can be con- 
structed, as we now show. Since observables are assumed to be bounded 
from below, T-L^uJq) > — 00, thus e W/3 ^° ' > for each legal transition. As 
a consequence of the Perron- Frobenius theorem [2"Tj 157] , C has a unique real 
positive eigenvalue Sp, strictly larger than the modulus of the other eigen- 
values (with a positive gap), and with right, R, and left, L, eigenvectors: 
CR = SpR, LC = SpL, or, equivalentljj^J 

£ eM"*)R( u ?) = 8 fi R(«#- 1 ); 
u(D)e{o,i} N 

L(u^)eM<)= S pL(u°). 

These eigenvectors have strictly positive entries R(.) > 0, L ( . ) > 0, 
functions of blocks of range D. They can be chosen so that the scalar product 
(L,R) — 1. We define: 

V(Hp) = log sp. (13) 
called "topological pressure". We discuss the origin of this term and its 



properties in the section 2.3.2 



2 The right eigenvector R has 2 nd entries R w corresponding to blocks of range D. It 
obeys ^2 W2 C WlW2 R W2 = spR Wl , where w\ = loq^ 1 and where the sum runs over blocks 
W2 — wf- Since C W1W2 is non zero only if the entries u>i,u>2 have the block wf- 1 in 
common, and since the right hand side (spR Wl ) fixes the value of wi, this sum holds in 
fact on all possible values of u)(D). The notation R w , although natural, does not make 
explicit the block involved. This is problematic when one wants to handle equations such 



as (19 1. As a consequence, we prefer to use the notation R (block) to make explicit this 



dependence. The same remark holds mutatis mutandis for the left eigenvector. 
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To define a Markov chain from the transfer matrix C (eq. [12] ) we intro- 
duce the normalised potential: 

<P(u°) = U f3 (u°)-gp(u°) (14) 

with: 

Gp{u>$) = log R ( cut 1 ) - log R ( ) + log S/3 , (15) 
and a family of transition probabilities: 

P [00(D) l^- 1 ] d J:VK D ) >0 . (16) 

These transition probabilities define a Markov chain which admits a 
unique invariant probability: 

^- 1 )=R{^- l )L{ut 1 ). (17) 

From the general form of block probabilities (|2| the probability of blocks 
of depth n > D is, in this case : 

M^He^^^VK' 1 ]- (18) 



thus, from (17), (14), (15): 



P M w o) 

l*[uo] = ~^dTtR ( <-d + i ) L ( c^ 1 ) , (19) 

S /3 



where Hp (ujq) is given by (11) 



2.3.2 Remarks 

1. We have been able to compute the probability of any blocks Uq. It is 
proportional to e^i"") and the proportionality factor has been com- 
puted. In the general case of spatio-temporal events, it depends on 
ut 1 and u£_ D+1 . 

The same arises in statistical mechanics when dealing with bound- 



ary conditions. The forms (18), (19), remind Gibbs distributions on 



spin lattices, with lattice translation-invariant probability distributions 
given specific boundary conditions. Given a spin-potential of spatial- 
range n, the probability of a spin block depends upon the state of the 
spin block, as well as spins states in a neighbourhood of that block. The 
conditional probability of this block given a fixed neighbourhood is the 
exponential of the energy characterizing physical interactions, within 
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the block, as well as interactions with the boundaries. In (18), spins 
are replaced by spiking patterns; space is replaced by time. Spatial 
boundary conditions are here replaced by the dependence upon U!q~ 1 



and u%_ D+1 . 



As a consequence, as soon as one is dealing with spatio-temporal events 
the normalization of conditional probabilities does not reduce to the 
mere division by: 

Z n = ^2eM u o) t (20) 



as easily checked in (19). 



2. The topological pressure obeys nevertheless: 

V{Hp) = lim - logZ n , 

and is analogue to a thermodynamic potential density (free energy, free 
enthalpy, pressure). This analogy is also clear in the variational princi- 



ple (23) below. To our best knowledge the term "topological pressure" 
has its roots in the thermodynamic formalism of hyperbolic (chaotic) 
maps |55l l4"6l [5]. In this context, this function can be computed as the 
grand potential of the grand canonical ensemble, as a cycle expansion 
over unstable periodic orbits. It is therefore equivalent to a pressure^] 
depending on topological properties (periodic orbits). 

3. In the case D = the Gibbs distribution reduces to (JTJ). One can 
indeed easily show that: 

w(0) 

where Zp is the partition function Additionally, since spike pat- 
terns occurring at distinct time are independent in the D = case, Z n 
in (20) can be written as Z n = Zp so that V(Jip) = log Zp. 



4. In the general case of spatio-temporal constraints, the normalization 
requires the consideration of normalizing function Qp depending as well 
on the blocks uJq . Thus, in addition to function Hp normalization in- 
troduces a second function of spike blocks. This increases consequently 
the complexity of Gibbs potentials and Gibbs distributions compared 
to the spatial (D = 0) case where Qp reduces to a constant. 



3 The grand potential $ obeys $ = —PV, where P is the physical pressure and V the 
volume. Therefore, the grand potential density is (minus) the pressure. 
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2.3.3 The maximum entropy principle 



We now show that the probability distribution defined this way solves the 
variational problem "maximising entropy under constraints". 
We define the entropy rate (or Kolmogorov- Sinai entropy): 

h[fi] = - limsup— J— ^["0 1 log/i[otf] , (21) 

where the sum holds over all possible blocks Uq. Note, that in the case of a 
Markov chain h[fi] also reads [17J: 

h[fi] = -$>K] P[c^D) logP[cj(D) Iwf 1 ], (22) 



.D 

"(I 



whereas, when D — 0, ft. [//] reduces to the definition ([5]). 

As a general result from ergodic theory [55, [30j [13] and mathematical 
statistical physics [22], there is a uniqud^ probability distribution fj, such 
that IE3 ESSES]: 

V[Up\= sup (/i[i/] + v[Up)\) = h[n\ + M^/a], (23) 



where V [ "H/3 ] is given by ( 13 ) . Aii nv is the set of all possible time-translation 
invariant probabilities on the set of rasters with N neurons and v\Hp\ = 
^2 u d TL^{oJq) is the average value of Hp with respect to the probability 

v. 



Looking at the second equality, the variational principle (23) selects, 
among all possible probabilities u, a unique one realizing the supremum. This 
is exactly the invariant distribution of the Markov chain and is the sought 



Gibbs distribution. It is clear from (23) that the topological pressure is the 



formal analogue to a thermodynamic potential density, where Hp somewhat 
fixes the "ensemble": v [Hp] = $^fc=i Pk^ [Ok] plays the role of /3E (canoni- 
cal ensemble), /3E — fj,N (grand canonical ensemble), ... in thermodynamics 



2.3.4 Inferring the parameters 

The inverse problem of finding the values of /3feS from the observables average 
measured on the data is a hard problem with no exact analytical solution. 

4 The result is straightforward here since we consider bounded potentials with finite 
range. 
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However, in the context of spatial models with pairwise interactions the wis- 
dom, coming from statistical physics and especially Ising model and spin- 
glasses, as well as from the Boltzmann machine learning community, can be 
used. As a consequence, in this context, several strategies were proposed. 
Ackley et al [2] proposed a technique to estimate the parameters of a Boltz- 
mann machine. This technique is effective for small networks but it is time 
consuming. In practice, the time necessary to learn the parameters increases 
exponentially with the number of units. To speed up the parameters estima- 
tion, analytical approximations of the inverse problem have been proposed, 
which express the parameters /3& as a nonlinear function of the correlations 
of the activity (see for example [69]. [52] . [58]. [3]. [2]. [26]. ggj). 

These methods do not give an exact result, but are computationally 
fast. We do not pretend to review all of them here, but we quote a few 
prominent examples. In [58j, Sessak and Monasson proposed a system- 
atic small-correlation expansion to solve the inverse Ising problem. They 
were able to compute couplings up to the third order in the correlations for 
generic magnetizations, and to the seventh order in the case of zero magne- 
tizations. Their resulting expansion outperforms existing algorithms on the 
Sherrington-Kirkpatrick spin-glass model [61J. 

Based on a high-field expansion of the Ising thermodynamic potential, 
Cocco et al [15J designed an algorithm to calculate the parameters in a time 
polynomial with N, where the couplings are expressed as a weighted sum 
over the power of the correlations. They did not obtained a closed analytical 
expression, but their algorithm could run in a time that was polynomial in 
the number of neurons. 

Other methods, based on Thouless-Anderson-Palmer equations |72j and 
linear response |29j, or information geometry [69], initially proposed in the 
field of spin-glasses, have been adapted and applied to spike train analysis 
(see e.g. the work done by Roudi and collaborators [53]). 

The success of these approximations depends on the dataset, and there 
is no a priori guarantee about their efficiency at finding the right values of 
the parameters. However, by getting closer to the correct solution, they can 
potentially speed up the convergence of the learning by starting with a seed 
much closer to the real solution than if taking a random starting point. 

Note also that all the techniques mentioned above have been designed for 
the case where there is no temporal interaction (except [T5l [53] which are 



discussed in the section 2.3.5). Now, we explain how the parameters estima- 



tion can be done in the spatio-temporal models. 

In the general case parameters /3&S can be determined thanks to the fol- 
lowing properties. 
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V [Hp] is a log generating function of cumulants. First: 

&P[Hf>\ 



fi[O k ]. (24) 



This is an extension of ^ to the time- dependent case 



Second: 



(25) 



where Co k o t i n ) is the correlation function between the two observables 
Ok and 0\ at time n. Note that correlation functions decay exponen- 
tially fast whenever Hp has finite range. So that Yln=o Co k o x ( n ) < 
+00. 



Eq. (25) characterizes the variation in the average value of Of- when 
varying fi\ (linear response). The corresponding matrix is a susceptibil- 
ity matrix. It controls the Gaussian fluctuations of observables around 
their mean (central limit theorem) |5"5l I4"fi"l IT3] . This is the general- 



ization of (10) to the time dependent case. As a particular case, the 
fluctuations of the empirical average [Of.] of Ok around its mean 
fi[Ok] are Gaussian with a mean-square deviation y/^° k ^_ ^° k \ 

It is clear that the structure of the linear response in the case of spatio- 
temporal constraints is quite a bit more complex than the case D = 



see eq. (10)). Actually, for D — 0, all correlations Co k Oi( n ) vanish 



for n > (distinct times are independent). 

V(%p) is a convex function of (3. As a consequence, if there is a set of 
(3 value, (3*, such that . 

d ^^=riO k ]=C k , (26) 

then this set is unique. Thus, the solution of the variational problem 



(23 ) is unique. 



Basically, eq. (24), (25), 26, tell us that techniques based on free en- 
ergy expansion in spatial models can be extended as well to spatio-temporal 
cases, where the free energy is replaced by the topological pressure. Obvi- 
ously, estimating (not to speak of computing) the topological pressure can 
be a formidable task. Although, the transfer matrix technique allows the 
computation of the topological pressure, the use of this method for large N 
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is hopeless (see section 3.1). However, techniques based on periodic orbit 
expansion and zeta functions, could be useful |46]. Additionally, cumulant 



expansions of the pressure, eq. (24) and (25) corresponding to the two first 
orders, suggest that extension of methods based on free energy expansion 
could be used. In addition to the works quoted above, we also think of 
constraint satisfaction problems by Mezard and Mora |39j and approaches 
based on Bethe free energy [79]. Finally, as we checked, the properties of 
spatio-temporal Gibbs distributions allows to extend the parameters esti- 
mation methods developed for the spatial case in [T8"| [S] to spatio-temporal 
distributions (to be published). 



2.3.5 Other spatio-temporal models 

Here we shortly review alternative spatio-temporal models. We essentially 
refer to approaches attempting to construct a Markov chain and related in- 
variant probability by proposing a specific form for the transition probabili- 
ties. 

A prominent example is provided by the so-called Linear-Nonlinear (LN) 
models and Generalized Linear Models (GLM) [381 SSJ [TSl SHI SZl SI SH] - 
Shortly, the idea is to model spike statistics by a point process where the 
instantaneous firing rate of a neuron is a nonlinear function of the past 
network activity including feedbacks and interaction between neurons [64J. 
This model has been applied in a wide variety of experimental settings 
H2 EH El SH [73 |UJ. Typically, referring e.g. to [4j, the rate n has 
the form (adapting to our notations): 



./ 



h + Ki.x + )H ij 



(27) 



where the kernel represents the z-th cell's linear receptive field and x 
an input. characterizes the effect of spikes emitted in the past by pre- 
synaptic neuron j on post-synaptic neuron i. In this approach, neurons are 
assumed to be conditionally-independent given the past. The probability 
to have a given spike-response to a stimulus, given the past activity of the 
network, reads as the product of firing rates (see e.g. eq. 2.4 in |4j). 

In |4] the authors use several Monte Carlo approaches to learn the param- 
eters of the model for a Bayesian decoding of the rasters. Comparing to the 
method presented in the previous sections, the main advantages of the GLM 
is: (i) The transition probability is known (postulated) from the beginning 
and does not require the heavy normalization imposed by potentials of the 
form (loT); (ii) The model parameters have a neurophysiological interpretation, 
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and their number grows at most as a power law in the number of neurons, 
as opposite to ([6]), where the parameters are delicate to interpret and whose 
number can become quite large, depending on the set of constraints. 



Note however that a model of the form (27) can be written as well in the 
form ([6]): this is a straightforward consequence of the Hammersley-Clifford 
theorem [23]. The parameters /3k in ^ are then nonlinear functions of the 
parameters in (27) (see [TT] for an example). 



The main drawback of this approach is the assumption of conditional 
independence between neurons: neurons are assumed independent at time t 



when the past, which appears in the function Hij in (27), is given, and the 
probability of a spiking pattern at time t is the product of neurons firing 
rates. On the opposite, the maximal entropy principle does not require this 
assumption. 

It is interesting to remark that the conditional independence assumption 
can be rigorously justified in conductance based Integrate and Fire models 
[TT| [T2] and the form of the function / can be explicitly found (this a sigmoid 
function instead of an exponential as usually postulated in GLM). This result 
holds true if only chemical synapses are involved (this is also implicit in the 



kernel form in (27) [4|), but conditional independence breaks down for 
example as soon as electric synapses (gap junctions) are involved: this can 
be mathematically shown in conductance based Integrate and Fire models 
|16j . Note that in this large part of correlations is due dynamical 

interactions between neurons: as a consequence they persist even if there is 
no shared input. 

Recently, Macke et al. [36] extended the GLM model to fix the lack 
of instantaneous correlations between neurons in the GLM. They added a 
common input function that has a linear temporal dynamics. However, one of 
the disadvantages of this technique is that its likelihood is not unimodal, and 
thus that computationally expensive Expectation-Maximization algorithms 
have to be used to fit parameters. 

The GLM model is usually used to model both the stimulus-response 
dependence as well as the interaction between neurons, while the MaxEnt 
models usually focus on the latter (but see [73J). 



To finish this subsection, we would like to quote two important works 
dealing with spatio-temporal events too. First, In [15] Cocco and co-workers 
consider retinal ganglion cells spiking activity with a dual approach: on one 
hand they consider an Ising model (and higher order spatial terms) where 
they propose an inverse method based on a cluster expansion to find ef- 
ficiently the coupling in Ising model from data; on the other hand, they 
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consider the problem of finding the parameters (synaptic couplings) in a In- 
tegrate and Fire model with noise, from its spike trains. In the weak noise 
limit the conditional probability of a spiking pattern given the past is given 
by a least action principle. This probability is a Gibbs distribution whose 
normalized potential is characterized by the action computed over an optimal 
path. This second approach allows the characterization of spatio temporal 
events. Especially it gives a very good fit of the cross-correlograms. 

Second, in [53j, the authors consider a one step memory Markov chain 
where the conditional probability has a time-dependent potential of Ising 
type. Adapting a Thouless-Anderson-Palmer [72] approach used formerly 
in the Sherrington-Kirkpatrick mean-field model of spin glasses [61] they 
propose an inversion algorithm to find the model-parameters. As in the 
GLM their model assumes conditional independence given the past (see eq. 
(1) in [53]). 



2.4 Comparing models 



Solving equations (26) provides an optimal choice for the Gibbs distribution 
fi, given the observables Ok- However, changing the set of observables pro- 
vides distinct Gibbs distributions, which does not approximate the hidden 
probability with the same accuracy. We need here a way to quantify the 
"distance" between the "model" (the Gibbs distribution fixed by the set of 
observables) and the exact, hidden, probability f/-*\ Here are several criteria 
of comparison. 



2.4.1 Kullback-Leibler divergence 

The Kullback-Leibler divergence between fx, ii^*' is given by: 



d{^*\ n) = lim sup -— - ^ [ < ] lo § 



(2f 



which provides some notion of asymmetric "distance" between \x and [i^*>. The 
KL divergence accounts for discrepancy between the predicted probability 
fi [uJq ] and the exact probability fj^*> [uJq] for all blocks of range n. 

This quantity is not numerically computable from (28). However, for fi 
a Gibbs distribution and //<*> a time-translation invariant probability, the 
following holds: 
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The topological pressure V [Hp] is given by (13) while [Hp] is esti- 
mated by vr£ T) [Up] = Ylk=i PhFu [Ok] = Ylk=i PkC k - 

Since /i w is unknown, h(fi^) is unknown, and can only be estimated 
from data, i.e. one estimates the entropy of the empirical probability, h(iTu ). 
There exist efficient methods for that. Note that the entropy of a Markov 
chain is readily given by eq. (22), so the entropy h(ir^) is obtained by 



replacing the exact probability P in eq. (|22j), by the empirical probability 
h(-n { J ] ). As T -> +00, h(-n { J ] ) -> /i(/i W ), at exponential rateQ For finite T 
finite size corrections exist, see e.g. Strong et al. [HE]- In figure [T] is plotted 
an example. For a potential "H^ with N = 5 neurons and range R = 2, 
containing all possible observables, we have plotted the difference between 
the exact probability (known from (22) and the explicit form (16), (17) of 
transition probabilities and invariant probability), and the approached en- 
tropy h(ir^) obtained by replacing the exact probability P by the empirical 
probability h(7r^), as a function of raster size T. We have also plotted the 
finite corrections method proposed by Strong et al. in [68]. 
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Figure 1: Difference between the exact probability and the approached en- 
tropy h(7r^), as a function of raster size T. The potential of test includes 
all the possible observables where weights are set as random values. 



Now, if one wants to compare how two Gibbs distributions //i,/^2 a P~ 



5 The rate is given by the spectral gap of the transfer matrix: the difference between the 
largest eigenvalue (it is real and positive) , and the modulus of the second largest eigenvalue 
(in modulus). 
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proximate data, one compares the divergence cLkl ( Hi) , dxL {l^*\H2) 
where h(fJ,^*') is independent of the model choice. Therefore, the comparison 
of two models can be done without computing h(fj,^*'). 



2.4.2 Comparison of observables average 

Another criterion, easier to compute, is to compare the expected value of 
the observables average, /i^ known from (K24J) to the empirical aver- 



age [ Ok ] ■ Error bars are expected to follow the central limit theorem 
where fluctuations are given by eq. (25). Examples are given in Fig. [2j 
Note that the comparison of observables average is less discriminant than 
minimizing the Kullback-Leibler divergence, since there are infinitely many 
possible models matching the observables average. 



3 Monte-Carlo method for spatio-temporal Gibbs 
distribution 

3.1 The advantages and limits of transfer matrix method 

The advantage of the transfer matrix technique method is that it is math- 
ematically exact: given a potential "H^, it gives the Gibbs distribution and 
topological pressure without computing a partition function; given the para- 
metric form ^ where the parameters /3fcS has to be determined ("learned"), 
it provides the unique solution. On numerical grounds, this method provides 
an optimal estimation, in the limits of the error made when observing the 
observables empirically, this error being characterized by the central limit 
theorem. Its main drawback is that the transfer matrix £ has 2^ entries ! 
Although, most of those entries are zero (2^ non zero entries per row, thanks 
to the compatibility conditions) it is simply too huge to handle cases where 
NR > 24. 

Focusing thus on the huge number of states in the set of blocks, it is 
clear that any method requiring the extensive description of the phase space 
fails as NR grows. Additionally, while the accessible phase space is huge, the 
observed phase space (e.g. in an experimental raster) is rather small. Several 
strategies exist to avoid the extensive description of the phase space. Here, 
we propose an approach based on Monte-Carlo sampling. 

The idea is the following. Given a potential Hp we find a strategy to 
approximately compute the average fi [ Ok ] of observables Ok under the Gibbs 
distribution /i, using a statistical Monte-Carlo sampling of the phase space. 
For that purpose, the algorithm generates a raster following the statistics 
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defined by the potential and computes the observables on this artificial 
raster. Thanks to the estimation of the observables, the parameters of the 
model (/3fcs) can be found by modifying their values to minimize iteratively 
the distance between the values of the observables estimated on the real 
raster, and the values estimated with the Monte-Carlo sampling. Powerful 
algorithms exist for that, taking into account the uncertainty on the empirical 
averages ruled by the central limit theorem ([18] and [8J)). 



3.2 The Monte- Carlo-Hastings algorithm 

The Monte-Carlo-Hastings method consists in sampling a target probability 
distribution fi by constructing a Markov chain whose invariant probability is 
fj, [25]. The transition probability of this Markov chain, between two states 
u;W an cl u/ 2 ) is: 

P^)|^]^max( ^ (2) ^ (1 ^^ (i) J ;1) . (29) 

The function QQ can have different forms, allowing in particular to speed- 
up the convergence rate of the algorithm. Such specific forms are highly de- 
pendent on the form of and there is no general recipe to determine Q, 



given The contribution of Q cancels in (29) whenever Q is symmetric 
(Q(u\u)') = Q(u)'\u)). We make this assumption in the sequel. Practically, 
we take Q as the uniform distribution corresponding to flipping one spike at 
each iteration of the method. 



In classical Monte-Carlo approaches in statistical physics, the normaliza- 
tion factor of the Gibbs distribution, the partition function, cancels when 



computing the ratio of two blocks probabilities 



M U 



V2) 



(1) 



The situation is dif- 



ferent in the presence of spatio temporal constraints, as shown in eq. (19): 
"boundary terms" L (cOq^ 1 ), R ) remain. Actually, the same would 

hold in statistical physics problem with spatial interactions if one were to 
compare the probability of bulk spin-chains with distinct boundary condi- 
tions. 

This problem can however be circumvented thanks to the following re- 
marks: 

1. If one compares the probability of two blocks u;W,o/ 2 ' of range n > 



2D + 1, with u 



0-1,(1) 



u° 1,(2) andu^ +1 



UJ. 



n,(2) 
n-D+1 



then (19) reads: 





' ,n,(2)" 




' n,(l)~ 

u 
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with 



Thus, the Monte-Carlo transition probability (29) is only expressed as 
a difference of potential of the two blocks. 

2. AHpiujW, u®, 0, n) = £f =1 f3 k AO k (u^ ,0, n), with: 

n-D 



1=0 



Since the O k s are monomials, many terms O fc (^ +D )-0 fc (^ +D ) cancel. 
Assuming that we flip a spike at position (k,t), k e { 1, . . . , AT }, t e 
{ .D, n — .D }, we have indeed: 



l=t-D 



Since the difference O k (c^ D+ '' (2) ) - O k G {-1,0,1}, the 

computational cost of AO k {u^ l \ a/ 2 **, 0, n) is minimal if one makes a 
list of monomials affected by the flip of spike (k, r), r — 0, . . . D. 

3.3 Convergence rate 

The goal of Monte-Carlo-Hastings algorithm is to generate a sample of a 
target probability obtained by iteration of the Markov chain defined by eq. 
(29). In our case, this sample is a raster uJq~ 1 , distributed according to a 
Gibbs distribution //. Call Nfn p the number of iterations ("flips" in our case) 
of the Monte-Carlo algorithm. As Nfu p — > +oo the probability that the 
algorithm generates a raster uJq" 1 tends to /i [wj -1 ]. Equivalently, if one 
generates N see( i rasters and denote # (u;^ -1 ) the number of occurrences of 



a specific bloc u , then: 

r i- # ( u 1 ) r t-11 
hm hm — — = /i I oo I . 

N seed -¥+OQ N fHp -++oo N see d 

The convergence is typically exponential with a rate depending on 1-Lp. 

Now, the goal here is to use a Monte-Carlo raster to estimate fi [ O k ] 
by performing the empirical average itffl [O k \ on that raster. However, as 
explained in section |2.1.4 even if the raster is distributed according to fx 



(corresponding thus to taking the limit Nfu p — > +oo) the empirical average 
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[Ok] is not equal to /j, [Ofc], it converges to /i [ Of. ] as T — >■ +00, with 
an exponential rate (see footnote [5]) . More precisely, the probability that the 

difference itu [Of.] — n[Ok\ exceeds some e > behaves like exp(— T x 
1(e)) where /(e), called large-deviations rate, is the Legendre transform of 
the topological pressure j 13 j . 
When T is large we have: 

H [ j n<p [O k ] - p [O k ] j > e] ~ exp(^f ) (30) 

where cr(Ok) = [C?fc] (1 — /i [0fc]) is the mean-square deviations of O k . 

As a consequence, to obtain the exact average \i [ O k ] from our Monte- 
Carlo algorithm we would need to take the limits: 

lim lim lim ^ ^ -, (31) 

in that order: they do not commute. A prominent illustration of this point 
is illustrated in fig. |4j 

For notation homogeneity we note from now on T — 1 = Nu m es for the 
raster length. When dealing with numerical simulations with a finite number 
of sample, the goal is to minimize the probability that the error is bigger than 
a real number e, by suitable choice of: 

• The raster length: T — 1 = Nu mes . 

• The number of flips: Nfu p . 

• The number of seed: N see d- 

Let us now establish a few relations between those parameters. First, it 
is somewhat evident that Nfu p must be at least proportional to iV x N times 
in order to give a chance to all spikes in the raster to be flipped at least once. 



This criterion respects the order of limits in (31). 

Since fi is ergodic one can in principle estimate the average of observables 
by taking N seed = 1 and taking N times large. However, the larger N times the 
larger Nfu p and too big N times leads to too long simulations. On the opposite, 
one could generate a large number N see d of raster with a small N t i mes . This 
would have the advantage of reducing N/u p as well. However, the error 
(30) would then be too large. So, one needs to find a compromise: N t i mes 



large enough to have small Gaussian fluctuations (30) and small enough to 
limit Nfu p . Then, by increasing N seec i, one approaches the optimal bound on 



fluctuations given by (30). Additionally, this provides error bars. 
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4 Numerical tests 



In this section, performance in terms of convergence rate and CPU time for 
increasing values of iV (number of neurons) are discussed. First, we consider 
potentials ^ where the OkS are observables of the form Q, ("polynomial 
potentials") and we compare the Monte-Carlo results to those obtained using 
the transfer matrix and Perron - Frobenius theorem. As discussed in section 



3.1 the transfer matrix method becomes rapidly numerically intractable, so 
that the comparison between Monte-Carlo averages and exact averages can- 
not be done for large N. To circumvent this problem, we introduce, in section 
|4.2 a specific class of potentials for which the analytical computation of the 
topological pressure as well as observable averages can be analytically done, 
whatever N, R. This provides another series of tests. 



4.1 Polynomial potentials 

In this section, we present Monte-Carlo simulations with a potential of the 
form where the OkS are 100 observables randomly chosen among the 2^ 
possibilities. More precisely, we select randomly a fraction of "rates", a 
fraction js of pairwise terms and so on. 



4.1.1 Checking observables average 

The figure [2] shows the comparison between the exact values of the observable 
averages (ordinate) and the estimated Monte-Carlo values (abscissa). The 
error bars have been computed with N see( i samples. The tests were performed 
with N seed = 20, N times = 10000 and N flip = 100000. In this example, N 
goes from 3 to 8 and R = 3. For larger values of NR the numerical compu- 
tation with the transfer matrix method is not possible any more (NR = 24 
corresponds to matrices of size 16777216 x 16777216). 



4.1.2 Convergence rate 

In this section, we show how the Kulback-Leibler divergence varies as a func- 
tion of N times . The figure [3] shows the evolution of the Kulback-Leibler diver- 
gence between the real distribution and its estimation with the Monte-Carlo 
method. 

We also consider the error 



error = max 

k=l...K 



1 



n[o k ] 



(32) 



26 




Figure 3: Evolution of Kulback-Leibler divergence 28 as a function N t 



v times' 



As developed in section 3J3 this quantity is expected to converge to if 
Names +oo when Nfu p grows proportional to N x N times . For finite 
Names, Nfup — > +oo the error is controlled by the central limit theorem. The 



probabi 



ity that the error on the average of observable Ok is bigger than e 



(eq. (30)), behaves like exp 



-Mis 



<x(Ofc) 



On the opposite, if Nfn p stays constant while iV# mea grows, the error is 
expected to first decrease up to a minimum after which it increases. This is 
because the number of flips is insufficient to reach the equilibrium distribution 
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of the Monte-Carlo-Hastings Markov chain. This effect is presented in fig. |4} 
It shows the error (32) as a function of N times for N = 3 to N = 7 neurons 
with 3 different N flip values (1000, 10000, 100000). 

Clearly, the number of flips Nfu p should be at least more than iV x N t imes 
in order to give a chance to all spikes in the raster to be flipped at least once. 
A value N/u p = 10 x N x Names seems to be enough and computationally 
reasonable to perform the estimations. With an N see d = 20, we have results 
with a reasonable error around the mean values. 




Figure 4: Error as a function of Nu mes , for several values of Nju p (1000, 
10000, 100000). (left) N = 3; (right) N = 7. 
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4.1.3 CPU time 



Here we compare the CPU time for a Monte-Carlo simulation and the time for 
a computation with the transfer matrix. The figure [5] illustrates this. We have 
plotted the CPU time necessary to obtain the observables average presented 
in fig. [2j for the Monte-Carlo Average and for the exact average, as a function 
of N. The CPU time for Monte-Carlo increases slighty more than linearly 
while the CPU time for transfer matrix method increases exponentially fast: 
note that R = 3 here so that _Rlog2 = 2.08, close from the exponential rate 
found by fit: 2.24. 

le+06 

100000 

10000 

I 1000 
£ 

3 100 
1 

u 

10 

1 

0,1 

0,01 

2 4 8 

N 

Figure 5: The CPU time necessary to obtain the observable averages pre- 
sented in fig. [2j for the Monte-Carlo Average (MC Av.) and for the exact 
average (Th. Av.), as a function of N. The full lines correspond to fit. 

We also plot in fig. [6] the CPU times as a function of N times with three 
N fUp values (1000, 10000, 100000), for 3 and 7 neurons. The CPU time 
increases in a linear fashion with the Names value. The CPU time also in- 
creases linearly with the Nfn p value (for the same N value). The simulations 
have been done on a computer with the following characteristics: 7 Intel(R) 
Xeon(R) 3.20GHz processors with a 31.5 Gb RAM. 

The figure [7] compares the CPU time increase with N times for several TV 
values. It show that the CPU time increases linearly with Nu m es (as figure 
[6]). However, the slope increases with the number of neurons N. 
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Ntimes Ntimes 

Figure 6: The CPU time (T cpu ) as a function of N times . T cpu increases in a 
lnear fashion with N t ime as a x N + b where a is a function of N and b is a 
function of Nfu p . 




Ntimes 

Figure 7: The CPU time (T cpu ) as a function of Nu mes for several N values 
(Nflip= 10000). 

4.2 An analytically solvable example as a benchmark 

In this section we consider a specific example of potentials for which the 
topological pressure is analytically computable, whatever N, R. As a con- 
sequence the average of observables and fluctuations can also be computed. 
This example is obviously rather specific, but its main interests are to pro- 
vide a didactic illustration of thermodynamic formalism application as well 
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as a benchmark for numerical methods. 



4.2.1 Analytical setting 

We fix the number of neurons N and the range R and we choose L distinct 
pairs (i h ti), I = 1 . . . L, k G {1,...,JV}, U G { 0, . . . , D - 1 }. To this 
set is associated a set of K = 2 L events Ek = (u^iti), . . . , ^(t/) ) , k = 
0, . . . , K — 1. For example, if L — 2, there are 4 possible events Sq = (0, 0) 
: neuron i\ is not firing at time t\ and neuron i 2 is not firing at time t 2 ; 
Si = (0, 1) : neuron ii is not firing at time ti and neuron i 2 is firing at time 
t 2 ; and so on. It is convenient to have a label k corresponding to the binary 
code of the event. 

We define K observables Ok of range D taking binary values 0, 1. For a 
block ojq, Ok [(jJq ] = if the event is not realized in the bloc uJq and is 
1 otherwise. In the example above, Oq [uq ] = 1 if neuron i\ is not firing 
at time t\ and neuron i 2 is not firing at time t 2 in the block Uq. Thus, for 
N = 3, R = 4, (ii, ti) = (1, 0); (i 2 ; t 2 ) = (1,1)," 



while 



O 



O 



1 
10 1 
10 1 



10 1 
10 1 
10 1 



O 



O 



1 
110 1 
110 1 



10 1 
110 1 
110 1 



We finally define a potential H as in (jfjj), Hp = J^Li PkO 



k ■ 



For this type of potentials, whatever cu® l , ^ w D - ie M^i is a inde- 
pendent of u(D). As a consequence of the Perron- Frobenius theorem s@ 



V D-ie"^") and therefore: 



V{H I3 ) = {N-L) log(2)+log 



K 

fc=i 



,/3 fc 



As a consequence, from (24), giving the average of observable Ok as the 
derivative of V with respect to /3k- 



The fluctuations of observables can also be estimated as well. They are 



Gaussian with a covariance matrix given by the Hessian of V (see eq. (25)) 
and the central limit theorem. 



Remarks. 
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• An important assumption here is that observables do not depend on 
u(D). This important simplification as well as the specific form of 
observables makes the computation of V tractable. Note however that, 
although "H does not depend on u(D) as well, the normalized potential 
and therefore the conditional probability P [u(D) | Uq^ 1 ] depend on 
U)(D) thanks to the normalization factor Q and its dependence in the 
right eigenvector R of the Perron-Frobenius matrix. 



4.2.2 Numerical results for large scale networks 

Let us now use this example as a benchmark for our Monte-Carlo method. 
We have considered a case with range R = 4 and L = 6 pairs, corresponding 
to 2 6 = 64 terms in the potential. We have analyzed the convergence rate 
as a function of N the number of neurons and Nti mes , the time length of the 
Monte-Carlo raster. The number of flips, Nfu p is fixed to 10 x iV x N times , 
so that, on average, each spike of the Monte-Carlo raster is flipped 10 times 
in one trial. 



In fig. [8] (left) we have shown the relative error (eq. 32) as a function of 
Ntimes for several values of N. The empirical average rc^[Ok] is computed 
on 10 Monte-Carlo rasters. That's why we don't write the index u in the 
empirical probability. We stopped the simulation when the error is lower 

than 5%. As expected from the Central Limit Theorem (CLT), the error 

_ i 

decreases as a power law CN ti ^ es where the constant C has been obtained 
by fit. 

In fig. [8] (right) we have drawn the CPU time as a function of N times 
for several values of N. It increases linearly with N times , with a coefficient 
depending on N. The simulation is relatively fast: it takes 2/t30 for N = 60, 
Names = 8000 (with 10 Monte-Carlo trials) on a 7 processors machine (each 
of these processors has the following specifications: Intel(R) Xeon(R) CPU 
2.27GHz, 1.55 MB of each memory size) with a 17.72 GB RAM. 



5 Discussion and perspectives 

In this paper, we have shown how maximum entropy models can be extended 
to analyse the spatio-temporal dynamics of the neural activity. This raises 
specific issues, mainly related to the fact that the normalization of the Gibbs 
potential depends on the past activity. We have shown that transfer matrices 
results allow to handle this problem properly, providing additionally crucial 
information on statistics (especially average of observables and fluctuations 
of empirical averages). The challenge is then to be able to fit these models 
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Ntimes Ntimes 

Figure 8: (Left). Relative error as a function of N times for several values of 
N. (CLT) indicates the decay expected from the central limit theorem. As 

expected from the Central Limit Theorem , the error decreases as power law 

_ i 

^ times- (Right). CPU time as a function of Ntimes for several values of N. It 
increases linearly as CPU = a x Nu mes In the legend, next to the value of N 
we indicate the value of a. 



to the recordings. A major step in the fitting process is to compute the 
observables generated by the model for a given set of parameters. We have 
proposed a first method, based on the transfer matrix. It gives exact results, 
but can only be applied to small subsets of neurons. We have then designed a 
Monte-Carlo approach that overcomes this issue, confirmed by several tests. 

In fact, matching Gibbs averages of observables is only a first, although 
crucial, step towards spike train analysis of neuronal activity. The next step 
consists of fitting the parameters of a model from an experimental raster. 
Basically, this corresponds to minimizing the Kullback-Leibler divergence 
(28) between the model and the empirical measure. We have reviewed some 



possible techniques in the section 2.3.4 The application of our method to 



fit Gibbs distributions on large-scale retina recordings will be considered in 
a forthcoming paper. 



As a final issue we would now like to discuss shortcomings of Maximum 
entropy Models. Although initially proposed as powerful methods for neu- 
roscience applications, many future reports have cast doubt on how useful 
(spatial pairwise) Maximum Entropy models were. These criticism include 
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the role of common input [35J, the role of higher order correlations [67, l4"2] . 
scaling properties [52J and non stationarity j67]. As a matter of fact, the mod- 



els presented in the section 2.3.5 are clear and efficient alternatives to spatial 



pairwise maximum entropy models. Let us now comment these shortcomings. 

First, as we have developed, the maximum entropy approach is definitely 
not limited to spatial pairwise interactions. Especially, the role of higher or- 
der interactions beyond pairwise equal time ones, provides a clear motivation 
for including longer temporal history in statistical models of neural data. 

This raises however the question of how one chooses the potential. As 



exposed in section |2.2.1| the possible number of constraints is simply over- 
whelming and one has to make choices to reduce their number. These choices 
can be based on ad hoc assumptions (e.g. rates or instantaneous pairwise cor- 
relations are essential in neuronal coding) or on empirical constraints (type of 
cells, spatial localisation). However, this combinatorial complexity is clearly 
a source of troubles and questions about the real efficiency of the maxi- 
mum entropy problem. This problem is particularly salient when dealing 
with temporal interactions of increasing memory: even the number of possi- 
ble pairwise interactions might be too large to fit all of them on finite size 
recordings. Additionally, using a too complex potential increases the number 
of parameters necessary to fit the data beyond what the number of available 
samples allows. 

One solution is to try to infer the form of the potential from the data 
set. Important theorems in ergodic theory (50] as well as in Variable Length 
Markov Chains Estimations can be used [10J. This will be developed in a 
separate paper. 

It is however quite restrictive to stick to potentials expressed as linear 
combinations of observables, like This form has its roots in thermody- 
namics and statistical physics, but is far from being the most general form. 



Nonlinear potentials such as (27) (GLM) can be considered as well. Although 
such potentials can be expressed in the form ^ from the Hammersley- 
Clifford theorem [23], this representation induces a huge redundancy in the 
coefficients Examples are known of non linear potentials with relatively 
small number of parameters A/, which, expressed in the form ([6]), give rise 
to 2^ parameters /3fcS, all of them being functions of the A/s: see [TTj IT2"] . 
Such potentials constitute relevant alternatives to ^ where the formalism 
described here fully applies. 

More generally, alternatives to maximum entropy models consider differ- 
ent models trying to mimic the origin of the observed correlations. This is 
the case of the model proposed by when common inputs are added to 
account for the instantaneous correlations and the GLM model where the 
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numbers of parameters is only N 2 . However, note that in all these cases, the 
models constrain the correlations to be in a specific form, and might not be 
a good description of the activity either. Testing these models on data is the 
only way to distinguish the most relevant ones. Note that the discrepancy 
between model and data will probably be more and more obvious with a 
larger set of neurons. The validity of a model will also depend on size of the 
recorded population. 

Another, even deeper, question is the translation-invariance assumption 
intrinsic to the maximum entropy principle. When dealing e.g. to transient 
responses to temporary stimuli this assumption is clearly highly controver- 
sial. Note however that although the maximum entropy principle does not 
extend to non translation-invariant statistics, the concept of Gibbs distri- 
bution extend to that case |22j . Here, Gibbs distributions are constructed 
via transition probabilities, possibly with an infinite memory. Examples of 
applications to neuronal networks can be found in [TT| [12] . However, the 
application of this concept to analyzing real data, especially the problem of 
parameters estimation, remains to our knowledge an open challenge. 
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